Reinforcement Learning: An Introduction: Limitations of 1-Step TD: The Case for n-Step Methods

Imagine a robot navigating a vast, dark 20-segment hallway to find a charging station at the end. With 1-step TD, this robot is remarkably "near-sighted." Even after a successful run, it only updates the value of the very last tile before the charger. It would take twenty successful trips for that "scent" of reward to reach the start of the hallway. This is the Efficiency Gap: information propagates too slowly for large state spaces.

Bridging the Gap: The n-step Method

By using an n-step return, we perform a "Complex Backup" that looks multiple steps into the future before bootstrapping. This creates a continuum between two extremes:

High Bias, Low Variance ($\lambda = 0$): 1-step TD reduces to immediate estimation. It is stable but crawls toward the truth.
Zero Bias, High Variance ($\lambda = 1$): TD(1) behaves like a Monte Carlo method for an undiscounted episodic task, looking at the entire actual outcome.

Empirical Evidence (Figure 7.2)

When analyzing a 19-state random walk, the data is clear: the "sweet spot" is in the middle. Intermediate values of $n$ consistently achieve the lowest RMS error. However, there is a trade-off: as $n$ increases, the method becomes more sensitive to the step-size $\alpha$, requiring a more cautious learning rate to avoid divergence.

QUESTION 1

Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of $n$? How about the change in leftside outcome from 0 to -1? Would that have made any difference in the best value of $n$?

Provide answer based on 150-word requirement.

QUESTION 2

Why do you think on-line methods worked better than off-line methods on the example task?

On-line methods use updated values immediately within the same episode.

Off-line methods are computationally faster.

On-line methods have lower variance because they wait for the end of the episode.